Whilst base R plots are quick and useful for examining
our data, they don’t always offer the flexibility and attractive
customization options that we’d like for a presentation or manuscript.
This is where ggplot2 comes in.
This session will teach you the basics of using ggplot2
to visualize data in R. ggplot was developed in 2005 by Hadley Wickham
as an open source data visualization package for R. With
ggplot2, you can create plots that range from simple
scatter diagrams to complex custom plots that are (almost) completely
customizable.
Resources:
https://towardsdatascience.com/guide-to-data-visualization-with-ggplot2-in-a-hour-634c7e3bc9dd
https://ggplot2.tidyverse.org/reference/
https://www.youtube.com/@Riffomonas
Install necessary packages
ggplot2 is included in the tidyverse
package, so we can simply install and load tidyverse for
everything we’ll cover in this session.
packages_to_install <- c("tidyverse","ggtext")
for (package in packages_to_install) {
if (!(package %in% rownames(installed.packages()))) {
install.packages(package, dependencies = TRUE)
print(paste("Installed package:", package))
} else {
print(paste(package, "is already installed"))
}
}
[1] "tidyverse is already installed"
[1] "ggtext is already installed"
library(tidyverse)
library(ggtext)
Understanding ggplot syntax
There are three fundamental elements that go into constructing a plot
with ggplot2:
Data - dataframe to be plotted
data = dataframe
Aesthetics - maps variables to elements of the plot
(i.e. x axis, y axis, color scheme, etc.)
mapping = aes()
Geometry/Layers - visual elements used for the data
+ geom_function()
The typical input code for ggplot will usually look something like
this:
ggplot(data = df, # input data
mapping = aes(x = var1, # input mapping aesthetics
y = var2,
color = var3)) +
geom_point() # add plotting layer
Setting up data for success
One of the most important parts of getting ready to plot data in
R is ensuring that your data are “tidy”. When passing
instructions to ggplot2, the program interprets dataframes
in a fixed way:
columns are variables
rows are observations
Let’s examine a dataframe to better understand how
ggplot2 interprets data. Here, we will be using the Iris
dataset. The Iris dataset is built-in to R and was
introduced by British statistician and biologist Ronald A. Fisher in
1936. Fisher collected the data to study the variation in iris flowers
of three different species: Iris setosa, Iris versicolor, and Iris
virginica.
head(iris) # view the first six rows of the dataframe
Looking at the first rows of this dataframe we can see that each
variable is contained in a column and each row is an
observation. This means that if you have replicate
measurements (as in this dataset, there are multiple measurements of
each variable per species) you will need to have a row per
replicate rather than storing the replicate data in columns.
Other important notes about this dataset:
it consists of four numeric variables (Sepal.Length, Sepal.Width,
Petal.Length, Petal.Width) and one categorical variable (Species). This
structured format makes it easy to map variables to aesthetics in
ggplot2.
the Iris dataset has a balanced class distribution. Each of the
three species (setosa, versicolor, virginica) has an equal number of
observations. This balance allows for fair visual comparisons and avoids
potential biases that can arise from imbalanced datasets.
column names and features contain no spaces or “-”.
R doesn’t usually like these.
Let’s build a plot!
ggplot2 builds plots in layers. You can start with a
layer showing raw data, then continue to add up additional elements to
produce your desired graph. This approach will help you reduce the gap
between the expected outcomes in your head and the plots in reality.
ggplot(data = iris, # use 'data' argument to tell ggplot which dataframe we want to plot from
mapping = aes(x = Petal.Length, # mapping determines which variables are assigned to plot elements
y = Petal.Width)) -> basicPlot
basicPlot

We have told ggplot2 which variables we want to plot on
the x and y axes but we have not told ggplot2 which
geometric elements (i.e. geoms) to use to construct the plot, so all we
have are the axes and a blank plot.
What are geoms?
Geoms are the geometric objects (e.g. lines, bars, etc.) that
determine how observations are rendered. Layering elements in a plot
usually starts with adding geoms. Let’s add geom_point() to
our basic plot to create a scatter plot:
ggplot(data = iris,
aes(x = Petal.Length, # it is very common to see 'mapping =' omitted from the code - ggplot will accept either
y = Petal.Width)) + # use a + to add elements to your plot
geom_point() -> basicScatter
basicScatter

Another way you can add layers to a plot is by simply adding them to
the end of the object that we assigned our first plot to. It is very
common to see this in online guides and forums (such as Stack Overflow)
where you might look for help with R coding:
basicPlot +
geom_point()

Although this generates the same output, I would generally avoid
making your plots this way - if you end up with something that isn’t
quite working as expected I find it can be easier to fix if all your
code is laid out in front of you, rather than having to revisit each
individual step in the process of making your plot.
Now we’ll start to add some more elements to our mapping aesthetics
to better illustrate our data.
IMPORTANT NOTE
The initial mapping that you specify in the ggplot2
command (i.e. axes, color, size, etc.) are by default used globally for
the plot and are carried over to any geoms you add in the following
code. Each geom can have it’s own separate mapping aesthetics,
which can allow you to create more complex plots. If you ever run into
issues where a geom is not behaving as you would expect, take a look
back through your code and check where your aesthetics were assigned,
and how they apply to the geoms you are trying to layer.
Let’s color our points by Species:
ggplot(data=iris,
mapping = aes(x=Petal.Length,
y=Petal.Width,
color = Species)) + # tell ggplot that color is determined by Species variable
geom_point() -> colorScatter
colorScatter

Now we can add a simple regression using geom_smooth()
and we can demonstrate how changing global vs. specific aesthetics
affect geoms. Some geoms have specialized arguments that allow them to
function. In this case, geom_smooth() allows us to tell it
which method to use to generate the curve that it will plot. We will opt
for lm which is a linear model.
ggplot(data=iris,
mapping = aes(x=Petal.Length,
y=Petal.Width,
color = Species)) +
geom_point() +
geom_smooth(method = lm) -> colorScatterlm
colorScatterlm

This creates three separate curves that map to the points from each
Species by color, as this is what is specified in the global aesthetics.
Let’s change that and plot a curve that spans all points:
ggplot(data=iris,
mapping = aes(x=Petal.Length,
y=Petal.Width)) + # remove color from global aesthetics
geom_point(aes(color = Species)) + # set geom_point aesthetics - this will only color points
geom_smooth(method = lm) -> colorScatterlm2
colorScatterlm2

Now we can see that the points are still colored by Species, but the
regression is not.
Let’s play with some more aesthetics:
ggplot(data=iris,
mapping = aes(x=Petal.Length,
y=Petal.Width)) +
geom_point(aes(color = "blue")) +
geom_smooth(method = lm) -> basicPlotBlue
basicPlotBlue

Notice how even though we have changed the aesthetic of the points to
be “blue”, it has not made them blue. If we want to make all the points
one color (or a different shape, or a different size) these are
not set by aesthetics, as they are not dependent on a
variable.
Let’s make our points blue and change their shape:
ggplot(data = iris,
mapping = aes(x = Petal.Length,
y = Petal.Width)) +
geom_point(color = "blue", # note that these options are not parsed through the aes() argument
size = 5,
shape = 1) -> openCircleBlue
openCircleBlue

# shapes are defined by a numerical value
# available shapes can be viewed at https://www.datanovia.com/en/blog/ggplot-point-shapes-best-tips/ or by using ggpubr::show_point_shapes()
There are other ways we can control how each layer is rendered. Let’s
start with controlling “scales”. Scales allow us to edit specific
elements of the aesthetics and are named in a uniform manner than
describes how they act and what they affect. The names are made up of
three pieces separated by “_”:
scale
- the name of the aesthetic (e.g.
color,
shape, size, etc.)
- the name of the scale (e.g.
manual,
continuous, discrete, etc.)
Within the scales there are several options we can edit:
name = this controls what the variable is called within
the plot or legend
values = this allows us to manually input the variables
(i.e. colors or shapes) used in the plot
labels = this controls the data labels (i.e. species
names) in the plot or legend
Let’s use scale_color_manual to manually select some
colors for our plot.
(There are set named colors that can be used in R https://stat.columbia.edu/~tzheng/files/Rcolor.pdf but
you can also use hex codes. Make sure to use color blind friendly color
schemes for figures you plan to present or publish!)
ggplot(data = iris,
mapping = aes(x = Petal.Length,
y = Petal.Width,
color = Species)) +
geom_point() +
geom_smooth(method = lm) +
scale_color_manual(name = "Iris species",
values = c("setosa" = "pink",
"versicolor" = "plum",
"virginica"="seagreen3"),
labels = c("Iris setosa",
"Iris versicolor",
"Iris virginica")) -> multiColor
multiColor

Now we have some different colors and data labels in our plot of the
Iris data. The name and labels options in the
scale are useful for changing how the data are labeled in
your plot without needing to manipulate the raw data.
We can also use continuous color scales to visually represent changes
in values:
ggplot(data = iris,
mapping = aes(x = Petal.Length,
y = Petal.Width)) +
geom_point(aes(color = Petal.Length)) +
geom_smooth(method = lm) -> blueCont
blueCont

You can also use other variables within the dataframe to control the
aesthetics of the plot.
ggplot(data = iris,
mapping = aes(x = Petal.Length,
y = Petal.Width)) +
geom_point(aes(color = Sepal.Length, # color is dependent on sepal length
size = Sepal.Width)) + # point size is dependent on sepal width
geom_smooth(method = lm,
color = "black",
se = FALSE) + # change the color of the curve
scale_color_gradient(high = "purple",
low = "orange") # manually set the colors of the gradient scale

Obviously there is a little too much data now contained in this plot
for it to be particularly useful, but it is a good example of how much
data you can display and the different ways you can present it using
R.
Let’s tidy up our plot and make something that looks a little more
“publication-ready”:
First, let’s assign our chosen color scale to a vector object so we
can call the same colors for any future plots without needing to write
out the code every time. This time I’m going to use a color blind
friendly palette generated using this tool: https://davidmathlogic.com/colorblind.
plotColors <- c("setosa" = "#648FFF",
"versicolor" = "#DC267F",
"virginica"="#FFB000")
ggplot(data = iris,
mapping = aes(x = Petal.Length,
y = Petal.Width)) +
geom_point(aes(color = Species)) +
geom_smooth(method = lm,
color = "black") +
theme_bw() + # this is a built-in theme that removes the gray plot background
scale_color_manual(values = plotColors) + # direct ggplot to our color vector
ylab("Petal width (mm)") + # change y axis label - can also be done with scales
labs(x = "Petal length (mm)",
color = "Iris species") + # change legend title - can also be done with scales as previously
ggtitle("Petal width by petal length per species") + # add plot title
theme(plot.title = element_text(hjust = 0.5)) -> multiColorTidy # center plot title
multiColorTidy

Perfect! Now we can write our plot to a pdf:
pdf("multi.color.iris.plot.lm.pdf", # file name to write to
height = 4, # plot height in inches
width = 6) # plot width in inches
multiColorTidy # tell R which plot to write to file
dev.off() # this tells R that you're done creating a file
null device
1
Or we can use ggsave(), which is a function of
ggplot2 to save as any other graphics file type:
ggsave(plot = multiColorTidy, # specify plot
"multi.color.iris.plot.lm.tiff", # specify file name
height = 4, # plot height
width = 6, # plot width
units = c("in"), # specify which units to use for height and width
device = "tiff") # specify file type for saving - ggsave will also guess depending on the extension used in file name
Plotting group means and error bars
Another way that we may want to plot our data is by plotting both
group means and individual data points. This can help people
better visualize the spread of our data. This is easy enough with geoms
like geom_boxplot() and geom_violin() that
have group metrics built into their functionality.
ggplot(data = iris,
aes(x = Species,
y = Petal.Length,
fill = Species)) +
geom_boxplot(outliers = FALSE) +
geom_point()

You can see that geom_boxplot has automatically
generated a box that displays the median (thick line) and box that spans
the 25th - 75th percentiles, with whiskers that extend to the furthest
value no more than 1.5 X the IQR from the box. Values beyond the
whiskers would be counted as outliers and plotted separately. This is
great for taking a quick summary look at your data.
But what if we want to use something like geom_bar()
that does not have built in group functionality?
There are a couple of ways we could solve this issue using functions
in the tidyverse package. First, we could create a summary
table that contains grouped information for our data using
summarise.
iris %>%
group_by(Species) %>% # group_by tells R which variable to use to group observations
summarise(mean.Petal.Length = mean(Petal.Length), # add a column containing mean values per species
standard.deviation = sd(Petal.Length)) -> irisSummary # add a column containing standard deviation
head(irisSummary)
We can use the summarise function to create a new
dataframe that contains a mean and standard deviation for each species.
We can write this to a new object and then use this for plotting by
providing each geom with a different dataframe.
ggplot() + # we do not want global mapping or data for this plot so none is put in the ggplot call
geom_col(data = irisSummary, # set the dataframe for the columns
aes(x = Species,
y = mean.Petal.Length,
fill = Species),
alpha = 0.5) +
geom_errorbar(data = irisSummary, # set the dataframe for the error bars
aes(x = Species,
ymin = (mean.Petal.Length - standard.deviation), # set the minimum error bar value
ymax = (mean.Petal.Length + standard.deviation)), # set the maximum error bar value
width = 0.2) +
geom_jitter(data = iris, # set the dataframe for the points
aes(x = Species,
y = Petal.Length,
color = Species),
width = 0.2, # make the total spread of the points narrower
shape = 1) # set the shape to open circle

Now we can see both the mean and individual values on our bar
plot.
Another, more streamlined, way of doing this is using
stat_summary, where we remove the need to create a separate
dataframe by using functions within the ggplot package.
ggplot(data = iris,
aes(x = Species,
y = Petal.Length)) +
stat_summary(geom = "col", # identify which geom we want
fun.data = mean_se, # tell stat_summary which function to apply to summarise the data
aes(fill = Species), # set aesthetics as normal
alpha = 0.5) +
stat_summary(geom = "errorbar",
fun.data = mean_se,
color = "black",
width = 0.2) +
geom_jitter(aes(color = Species),
shape = 1,
width = 0.2)

Voilà! We have almost same plot as above but with a step
removed. However, you may have noticed that we used function
mean_se, which calculates the mean and standard error for a
vector of y values at each unique x value
(i.e. the function receives a vector of values for Petal.Length
for each Species) and most of the time we like to use standard
deviation. stat_summary does not offer this function as
part of the package - so what do we do? Create our own.
mean.sd <- function(x){
tibble(y = mean(x), # tell the function that we want a tibble output (similar to dataframe)
ymin = y - sd(x), # calculates the minimum value for error bar
ymax = y + sd(x)) # calculates the maximum value for error bar
}
Now we can create our plot:
ggplot(data = iris,
aes(x = Species,
y = Petal.Length)) +
stat_summary(geom = "col",
fun.data = mean.sd,
aes(fill = Species),
alpha = 0.5) +
stat_summary(geom = "errorbar",
fun.data = mean.sd,
color = "black",
width = 0.2) +
geom_jitter(aes(color = Species),
shape = 1,
width = 0.2)

Facets
Faceting is a technique that allows us to separate data out into
panels based on a variable in the dataframe. This is useful for
visualizing complex data where it may be easier to see patterns when the
data are separated.
There are two methods to create facets in a plot:
facet_wrap() and facet_grid(). If you are only
creating facets based on one variable (e.g. species) you can use
facet_wrap() but if you have a more complex plot where you
want to create facets based on two variables (e.g. species and
time point) you need to use facet_grid().
Let’s pull up another of R’s built-in datasets (mtcars)
that will allow us to see both of these in action. mtcars is built from
data extracted from the 1974 Motor Trend US magazine, and comprises fuel
consumption and 10 aspects of automobile design and performance for 32
automobiles (1973–74 models).
head(mtcars)
Let’s look at mpg (Miles per US Gallon) plotted against hp (Gross
horsepower).
ggplot(data = mtcars,
aes(x = hp,
y = mpg,
color = mpg)) +
geom_point(size = 3)

Now let’s use facet_wrap() to split these data up by vs
(Engine shape, 0 = V, 1 = straight).
ggplot(data = mtcars,
aes(x = hp,
y = mpg,
color = mpg)) +
geom_point(size = 3) +
facet_wrap(~ vs)

Let’s add another variable facet with facet_grid() and
split the data by am (Transmission, 0 = automatic, 1 = manual) as
well.
ggplot(data = mtcars,
aes(x = hp,
y = mpg,
color = mpg)) +
geom_point(size = 3) +
facet_grid(cols = vars(vs), # assign a variable to the column panels
rows = vars(am)) # assign a variable to the row panels

We can see that there are different correlations between hp and mpg
depending on the other qualities of the car. However, this plot is now
difficult to read because both variables are binaries, meaning it’s hard
to tell what’s what. Let’s tidy up these plots and add some labels.
Changing the panel labels without changing the underlying data is
slightly more complex than changing axis titles, so let’s look at how to
do that.
vsLabs <- c("0" = "V-shaped",
"1" = "Straight") # create a vector that matches the binary variables to their values
amLabs <- c("0" = "Automatic",
"1" = "Manual") # do the same for the am variable
ggplot(data = mtcars,
aes(x = hp,
y = mpg,
color = mpg)) +
geom_point(size = 3) +
facet_grid(cols = vars(vs),
rows = vars(am),
labeller = labeller(.cols = vsLabs, # use the labeller function to assign these labels to the rows and columns of the plot
.rows = amLabs)) -> facetPlot
facetPlot

Let’s tidy the rest of this plot up and then save it to file.
facetPlot +
scale_color_gradient(name = "Miles per\nUS Gallon", # \n starts a new line in the legend title
high = "purple",
low = "orange") + # change color of scale
theme_bw() +
xlab("Gross horsepower") + # add x axis title
ylab("Miles per US Gallon") + # add y axis title
theme(strip.background = element_rect(fill = "white")) -> facetPlotTidy # remove grey background from panel titles
facetPlotTidy

Now we can use the same methods as earlier to save our plot to either
a PDF or image file (or both!).
pdf("multi.facet.mtcars.plot.pdf", # file name to write to
height = 4, # plot height in inches
width = 6) # plot width in inches
facetPlotTidy # tell R which plot to write to file
dev.off() # this tells R that you're done creating a file
null device
1
ggsave(plot = facetPlotTidy, # specify plot
"multi.facet.mtcars.plot.tiff", # specify file name
height = 4, # plot height
width = 6, # plot width
units = c("in"), # specify which units to use for height and width
device = "tiff") # specify file type for saving - ggsave will also guess depending on the extension used in file name
Plotting a time course
For many experiments, it’s important to be able to plot a time
course. Let’s load in some example colony count data from an experiment
growing four species of bacteria in both high and low iron conditions,
with time points at 0, 6, and 24 hours.
read.csv("cfu_counts_raw.csv") -> counts # read in counts
Let’s take a quick look at the format of the data we just loaded and
check that the format looks correct for plotting.
head(counts)
Our dataframe has columns as variables and rows as
observations so we’re good to go!
In order to plot a time course as a discrete variable that runs along
the x axis, we need to change the time variable from
numeric to a factor in both the raw counts and group means dataframes.
Factors can help us control the order in which observations are plotted.
By default, ggplot will plot numeric variables in ascending order and
character or factor variables in alphabetical order. So, we’ll also set
the iron level as a factor because I want to plot the low iron condition
before the high iron condition.
counts$time <- factor(counts$time)
counts$iron <- factor(counts$iron,
levels = c("Low iron","High iron"))
Now we can set our custom colors for the plot.
speciesCols <- c("Pseudomonas aeruginosa" = "#43ba8f",
"Staphylococcus aureus" = "#fec44f",
"Streptococcus sanguinis" = "#4292c6",
"Burkholderia orbicola" = "#d57bd4")
Let’s create a line plot of log10 CFU/mL per species over time, with
facets showing the high and low iron. We will plot a ribbon that
represents the standard deviation (sd), thin lines that
represent each replicate (tech.rep), and a thick line
representing the mean CFU/mL for each species (mean.cfu).
We’ll utilize the stat_summary function that we saw
earlier.

Activities
Green 1
Create a scatter plot using the columns Sepal.Length (x) and
Sepal.Width (y) from the iris dataset.
Green 2
Make a plot where all the points are green, and the line is colored
by the species of iris.
Blue 1
Make a plot that includes regression lines for individual species as
well as the overall data.

---
title: "Introduction to ggplot"
author: "Yasmin Hilliam, PhD"
date: "2025-07-05"
output: html_notebook
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      collapse = TRUE)
```

Whilst base `R` plots are quick and useful for examining our data, they don't always offer the flexibility and attractive customization options that we'd like for a presentation or manuscript. This is where `ggplot2` comes in.

This session will teach you the basics of using `ggplot2` to visualize data in R. ggplot was developed in 2005 by Hadley Wickham as an open source data visualization package for R. With `ggplot2`, you can create plots that range from simple scatter diagrams to complex custom plots that are (almost) completely customizable.

Resources:

<https://towardsdatascience.com/guide-to-data-visualization-with-ggplot2-in-a-hour-634c7e3bc9dd>

<https://ggplot2.tidyverse.org/reference/>

[https://www.youtube.com/\@Riffomonas](https://www.youtube.com/@Riffomonas){.uri}

------------------------------------------------------------------------

#### Install necessary packages

`ggplot2` is included in the `tidyverse` package, so we can simply install and load `tidyverse` for everything we'll cover in this session.

```{r, message = FALSE}
packages_to_install <- c("tidyverse","ggtext")

for (package in packages_to_install) {
  if (!(package %in% rownames(installed.packages()))) {
    install.packages(package, dependencies = TRUE)
    print(paste("Installed package:", package))
  } else {
    print(paste(package, "is already installed"))
  }
}

library(tidyverse)
library(ggtext)
```

#### Understanding ggplot syntax

There are three fundamental elements that go into constructing a plot with `ggplot2`:

**Data** - dataframe to be plotted `data = dataframe`

**Aesthetics** - maps variables to elements of the plot (i.e. x axis, y axis, color scheme, etc.) `mapping = aes()`

**Geometry/Layers** - visual elements used for the data `+ geom_function()`

The typical input code for ggplot will usually look something like this:

```{r, eval = FALSE}
ggplot(data = df, # input data
       mapping = aes(x = var1, # input mapping aesthetics
                     y = var2,
                     color = var3)) +
  geom_point() # add plotting layer
```

#### Setting up data for success

One of the most important parts of getting ready to plot data in `R` is ensuring that your data are "tidy". When passing instructions to `ggplot2`, the program interprets dataframes in a fixed way:

**columns** are variables

**rows** are observations

Let's examine a dataframe to better understand how `ggplot2` interprets data. Here, we will be using the Iris dataset. The Iris dataset is built-in to `R` and was introduced by British statistician and biologist Ronald A. Fisher in 1936. Fisher collected the data to study the variation in iris flowers of three different species: Iris setosa, Iris versicolor, and Iris virginica.

```{r, message = FALSE}
head(iris) # view the first six rows of the dataframe
```

Looking at the first rows of this dataframe we can see that each **variable** is contained in a column and each row is an **observation**. This means that if you have replicate measurements (as in this dataset, there are multiple measurements of each variable per species) you will need to have a row *per replicate* rather than storing the replicate data in columns.

Other important notes about this dataset:

-   it consists of four numeric variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and one categorical variable (Species). This structured format makes it easy to map variables to aesthetics in `ggplot2`.

-   the Iris dataset has a balanced class distribution. Each of the three species (setosa, versicolor, virginica) has an equal number of observations. This balance allows for fair visual comparisons and avoids potential biases that can arise from imbalanced datasets.

-   column names and features contain no spaces or "-". `R` doesn't usually like these.

#### Let's build a plot!

`ggplot2` builds plots in layers. You can start with a layer showing raw data, then continue to add up additional elements to produce your desired graph. This approach will help you reduce the gap between the expected outcomes in your head and the plots in reality.

```{r}
ggplot(data = iris, # use 'data' argument to tell ggplot which dataframe we want to plot from
       mapping = aes(x = Petal.Length, # mapping determines which variables are assigned to plot elements
                     y = Petal.Width)) -> basicPlot

basicPlot
```

We have told `ggplot2` which variables we want to plot on the x and y axes but we have not told `ggplot2` which geometric elements (i.e. geoms) to use to construct the plot, so all we have are the axes and a blank plot.

#### What are geoms?

Geoms are the geometric objects (e.g. lines, bars, etc.) that determine how observations are rendered. Layering elements in a plot usually starts with adding geoms. Let's add `geom_point()` to our basic plot to create a scatter plot:

```{r}
ggplot(data = iris,
       aes(x = Petal.Length, # it is very common to see 'mapping =' omitted from the code - ggplot will accept either
           y = Petal.Width)) + # use a + to add elements to your plot
  geom_point() -> basicScatter

basicScatter
```

Another way you can add layers to a plot is by simply adding them to the end of the object that we assigned our first plot to. It is very common to see this in online guides and forums (such as Stack Overflow) where you might look for help with `R` coding:

```{r}
basicPlot +
  geom_point()
```

Although this generates the same output, I would generally avoid making your plots this way - if you end up with something that isn't quite working as expected I find it can be easier to fix if all your code is laid out in front of you, rather than having to revisit each individual step in the process of making your plot.

Now we'll start to add some more elements to our mapping aesthetics to better illustrate our data.

------------------------------------------------------------------------

**IMPORTANT NOTE**

The initial mapping that you specify in the `ggplot2` command (i.e. axes, color, size, etc.) are by default used globally for the plot and are carried over to any geoms you add in the following code. Each geom *can* have it's own separate mapping aesthetics, which can allow you to create more complex plots. If you ever run into issues where a geom is not behaving as you would expect, take a look back through your code and check where your aesthetics were assigned, and how they apply to the geoms you are trying to layer.

------------------------------------------------------------------------

Let's color our points by Species:

```{r}
ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width, 
                     color = Species)) + # tell ggplot that color is determined by Species variable
  geom_point() -> colorScatter
          
colorScatter
```

Now we can add a simple regression using `geom_smooth()` and we can demonstrate how changing global vs. specific aesthetics affect geoms. Some geoms have specialized arguments that allow them to function. In this case, `geom_smooth()` allows us to tell it which method to use to generate the curve that it will plot. We will opt for `lm` which is a linear model.

```{r}
ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width, 
                     color = Species)) + 
  geom_point() +
  geom_smooth(method = lm) -> colorScatterlm
          
colorScatterlm
```

This creates three separate curves that map to the points from each Species by color, as this is what is specified in the global aesthetics. Let's change that and plot a curve that spans all points:

```{r}
ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width)) + # remove color from global aesthetics
  geom_point(aes(color = Species)) + # set geom_point aesthetics - this will only color points
  geom_smooth(method = lm) -> colorScatterlm2

colorScatterlm2
```

Now we can see that the points are still colored by Species, but the regression is not.

Let's play with some more aesthetics:

```{r}
ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width)) + 
  geom_point(aes(color = "blue")) + 
  geom_smooth(method = lm) -> basicPlotBlue

basicPlotBlue
```

Notice how even though we have changed the aesthetic of the points to be "blue", it has not made them blue. If we want to make all the points one color (or a different shape, or a different size) these are *not* set by aesthetics, as they are not dependent on a variable.

Let's make our points blue and change their shape:

```{r}
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(color = "blue", # note that these options are not parsed through the aes() argument
             size = 5,
             shape = 1) -> openCircleBlue

openCircleBlue

# shapes are defined by a numerical value
# available shapes can be viewed at https://www.datanovia.com/en/blog/ggplot-point-shapes-best-tips/ or by using ggpubr::show_point_shapes()
```

There are other ways we can control how each layer is rendered. Let's start with controlling "scales". Scales allow us to edit specific elements of the aesthetics and are named in a uniform manner than describes how they act and what they affect. The names are made up of three pieces separated by "\_":

-   `scale`
-   the name of the aesthetic (e.g. `color`, `shape`, `size`, etc.)
-   the name of the scale (e.g. `manual`, `continuous`, `discrete`, etc.)

Within the scales there are several options we can edit:

-   `name =` this controls what the variable is called within the plot or legend
-   `values =` this allows us to manually input the variables (i.e. colors or shapes) used in the plot
-   `labels =` this controls the data labels (i.e. species names) in the plot or legend

Let's use `scale_color_manual` to manually select some colors for our plot.

(There are set named colors that can be used in `R` <https://stat.columbia.edu/~tzheng/files/Rcolor.pdf> but you can also use hex codes. Make sure to use color blind friendly color schemes for figures you plan to present or publish!)

```{r}
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width,
                     color = Species)) +
  geom_point() +
  geom_smooth(method = lm) +
  scale_color_manual(name = "Iris species",
                     values = c("setosa" = "pink",
                                "versicolor" = "plum",
                                "virginica"="seagreen3"),
                     labels = c("Iris setosa",
                                "Iris versicolor",
                                "Iris virginica")) -> multiColor

multiColor
```

Now we have some different colors and data labels in our plot of the Iris data. The `name` and `labels` options in the `scale` are useful for changing how the data are labeled in your plot *without* needing to manipulate the raw data.

We can also use continuous color scales to visually represent changes in values:

```{r}
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(aes(color = Petal.Length)) +
  geom_smooth(method = lm) -> blueCont

blueCont
```

You can also use other variables within the dataframe to control the aesthetics of the plot.

```{r}
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(aes(color = Sepal.Length, # color is dependent on sepal length
                 size = Sepal.Width)) + # point size is dependent on sepal width
  geom_smooth(method = lm,
              color = "black",
              se = FALSE) + # change the color of the curve
  scale_color_gradient(high = "purple",
                       low = "orange") # manually set the colors of the gradient scale
```

Obviously there is a little too much data now contained in this plot for it to be particularly useful, but it is a good example of how much data you can display and the different ways you can present it using `R`.

Let's tidy up our plot and make something that looks a little more "publication-ready":

First, let's assign our chosen color scale to a vector object so we can call the same colors for any future plots without needing to write out the code every time. This time I'm going to use a color blind friendly palette generated using this tool: <https://davidmathlogic.com/colorblind>.

```{r}
plotColors <- c("setosa" = "#648FFF",
                "versicolor" = "#DC267F",
                "virginica"="#FFB000")
```

```{r}
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(aes(color = Species)) +
  geom_smooth(method = lm,
              color = "black") +
  theme_bw() + # this is a built-in theme that removes the gray plot background
  scale_color_manual(values = plotColors) + # direct ggplot to our color vector
  ylab("Petal width (mm)") + # change y axis label - can also be done with scales
  labs(x = "Petal length (mm)",
  color = "Iris species") + # change legend title - can also be done with scales as previously
  ggtitle("Petal width by petal length per species") + # add plot title 
  theme(plot.title = element_text(hjust = 0.5)) -> multiColorTidy # center plot title

multiColorTidy
```

Perfect! Now we can write our plot to a pdf:

```{r}
pdf("multi.color.iris.plot.lm.pdf", # file name to write to
    height = 4, # plot height in inches
    width = 6) # plot width in inches

multiColorTidy # tell R which plot to write to file

dev.off() # this tells R that you're done creating a file
```

Or we can use `ggsave()`, which is a function of `ggplot2` to save as any other graphics file type:

```{r}
ggsave(plot = multiColorTidy, # specify plot
       "multi.color.iris.plot.lm.tiff", # specify file name
       height = 4, # plot height
       width = 6, # plot width
       units = c("in"), # specify which units to use for height and width
       device = "tiff") # specify file type for saving - ggsave will also guess depending on the extension used in file name 
```

#### Plotting group means and error bars

Another way that we may want to plot our data is by plotting both group means *and* individual data points. This can help people better visualize the spread of our data. This is easy enough with geoms like `geom_boxplot()` and `geom_violin()` that have group metrics built into their functionality.

```{r}
ggplot(data = iris,
       aes(x = Species,
           y = Petal.Length,
           fill = Species)) + 
  geom_boxplot(outliers = FALSE) +
  geom_point()
```

You can see that `geom_boxplot` has automatically generated a box that displays the median (thick line) and box that spans the 25th - 75th percentiles, with whiskers that extend to the furthest value no more than 1.5 X the IQR from the box. Values beyond the whiskers would be counted as outliers and plotted separately. This is great for taking a quick summary look at your data.

But what if we want to use something like `geom_bar()` that does not have built in group functionality?

There are a couple of ways we could solve this issue using functions in the `tidyverse` package. First, we could create a summary table that contains grouped information for our data using `summarise`.

```{r}
iris %>%
  group_by(Species) %>% # group_by tells R which variable to use to group observations
  summarise(mean.Petal.Length = mean(Petal.Length), # add a column containing mean values per species
            standard.deviation = sd(Petal.Length)) -> irisSummary # add a column containing standard deviation

head(irisSummary)
```

We can use the `summarise` function to create a new dataframe that contains a mean and standard deviation for each species. We can write this to a new object and then use this for plotting by providing each `geom` with a different dataframe.

```{r}
ggplot() + # we do not want global mapping or data for this plot so none is put in the ggplot call
  geom_col(data = irisSummary, # set the dataframe for the columns
           aes(x = Species,
               y = mean.Petal.Length,
               fill = Species),
           alpha = 0.5) +
  geom_errorbar(data = irisSummary, # set the dataframe for the error bars
                aes(x = Species,
                    ymin = (mean.Petal.Length - standard.deviation), # set the minimum error bar value
                    ymax = (mean.Petal.Length + standard.deviation)), # set the maximum error bar value
                width = 0.2) +
  geom_jitter(data = iris, # set the dataframe for the points
              aes(x = Species,
                  y = Petal.Length,
                  color = Species),
              width = 0.2, # make the total spread of the points narrower
              shape = 1) # set the shape to open circle
```

Now we can see both the mean and individual values on our bar plot.

Another, more streamlined, way of doing this is using `stat_summary`, where we remove the need to create a separate dataframe by using functions within the `ggplot` package.

```{r}
ggplot(data = iris,
       aes(x = Species,
           y = Petal.Length)) +
  stat_summary(geom = "col", # identify which geom we want
               fun.data = mean_se, # tell stat_summary which function to apply to summarise the data
               aes(fill = Species), # set aesthetics as normal
               alpha = 0.5) +
  stat_summary(geom = "errorbar",
               fun.data = mean_se,
               color = "black",
               width = 0.2) +
  geom_jitter(aes(color = Species),
              shape = 1,
              width = 0.2)
```

Voilà! We have *almost* same plot as above but with a step removed. However, you may have noticed that we used function `mean_se`, which calculates the mean and standard error for a vector of `y` values at each unique `x` value (*i.e.* the function receives a vector of values for Petal.Length for each Species) and most of the time we like to use standard deviation. `stat_summary` does not offer this function as part of the package - so what do we do? Create our own.

```{r}
mean.sd <- function(x){
  tibble(y = mean(x), # tell the function that we want a tibble output (similar to dataframe)
         ymin = y - sd(x), # calculates the minimum value for error bar
         ymax = y + sd(x)) # calculates the maximum value for error bar
}
```

Now we can create our plot:

```{r}
ggplot(data = iris,
       aes(x = Species,
           y = Petal.Length)) +
  stat_summary(geom = "col", 
               fun.data = mean.sd, 
               aes(fill = Species),
               alpha = 0.5) +
  stat_summary(geom = "errorbar",
               fun.data = mean.sd,
               color = "black",
               width = 0.2) +
  geom_jitter(aes(color = Species),
              shape = 1,
              width = 0.2)
```

#### Facets

Faceting is a technique that allows us to separate data out into panels based on a variable in the dataframe. This is useful for visualizing complex data where it may be easier to see patterns when the data are separated.

There are two methods to create facets in a plot: `facet_wrap()` and `facet_grid()`. If you are only creating facets based on one variable (e.g. species) you can use `facet_wrap()` but if you have a more complex plot where you want to create facets based on two variables (e.g. species *and* time point) you need to use `facet_grid()`.

Let's pull up another of `R`'s built-in datasets (mtcars) that will allow us to see both of these in action. mtcars is built from data extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973--74 models).

```{r}
head(mtcars)
```

Let's look at mpg (Miles per US Gallon) plotted against hp (Gross horsepower).

```{r}
ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3)
```

Now let's use `facet_wrap()` to split these data up by vs (Engine shape, 0 = V, 1 = straight).

```{r}
ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3) +
  facet_wrap(~ vs)
```

Let's add another variable facet with `facet_grid()` and split the data by am (Transmission, 0 = automatic, 1 = manual) as well.

```{r}
ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3) +
  facet_grid(cols = vars(vs), # assign a variable to the column panels
             rows = vars(am)) # assign a variable to the row panels
```

We can see that there are different correlations between hp and mpg depending on the other qualities of the car. However, this plot is now difficult to read because both variables are binaries, meaning it's hard to tell what's what. Let's tidy up these plots and add some labels.

Changing the panel labels without changing the underlying data is slightly more complex than changing axis titles, so let's look at how to do that.

```{r}
vsLabs <- c("0" = "V-shaped",
            "1" = "Straight") # create a vector that matches the binary variables to their values

amLabs <- c("0" = "Automatic",
            "1" = "Manual") # do the same for the am variable

ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3) +
  facet_grid(cols = vars(vs),
             rows = vars(am),
             labeller = labeller(.cols = vsLabs, # use the labeller function to assign these labels to the rows and columns of the plot
                                 .rows = amLabs)) -> facetPlot

facetPlot
```

Let's tidy the rest of this plot up and then save it to file.

```{r}
facetPlot +
  scale_color_gradient(name = "Miles per\nUS Gallon", # \n starts a new line in the legend title
                       high = "purple",
                       low = "orange") + # change color of scale
  theme_bw() +
  xlab("Gross horsepower") + # add x axis title
  ylab("Miles per US Gallon") + # add y axis title
  theme(strip.background = element_rect(fill = "white")) -> facetPlotTidy # remove grey background from panel titles

facetPlotTidy
```

Now we can use the same methods as earlier to save our plot to either a PDF or image file (or both!).

```{r}
pdf("multi.facet.mtcars.plot.pdf", # file name to write to
    height = 4, # plot height in inches
    width = 6) # plot width in inches

facetPlotTidy # tell R which plot to write to file

dev.off() # this tells R that you're done creating a file
```

```{r}
ggsave(plot = facetPlotTidy, # specify plot
       "multi.facet.mtcars.plot.tiff", # specify file name
       height = 4, # plot height
       width = 6, # plot width
       units = c("in"), # specify which units to use for height and width
       device = "tiff") # specify file type for saving - ggsave will also guess depending on the extension used in file name
```

#### Plotting a time course

For many experiments, it's important to be able to plot a time course. Let's load in some example colony count data from an experiment growing four species of bacteria in both high and low iron conditions, with time points at 0, 6, and 24 hours.

```{r}
read.csv("cfu_counts_raw.csv") -> counts # read in counts
```

Let's take a quick look at the format of the data we just loaded and check that the format looks correct for plotting.

```{r}
head(counts)
```

Our dataframe has columns as **variables** and rows as **observations** so we're good to go!

In order to plot a time course as a discrete variable that runs along the x axis, we need to change the `time` variable from numeric to a factor in both the raw counts and group means dataframes. Factors can help us control the order in which observations are plotted. By default, ggplot will plot numeric variables in ascending order and character or factor variables in alphabetical order. So, we'll also set the iron level as a factor because I want to plot the low iron condition *before* the high iron condition.

```{r}
counts$time <- factor(counts$time)

counts$iron <- factor(counts$iron,
                      levels = c("Low iron","High iron"))
```

Now we can set our custom colors for the plot.

```{r}
speciesCols <- c("Pseudomonas aeruginosa" = "#43ba8f",
            "Staphylococcus aureus" = "#fec44f",
            "Streptococcus sanguinis" = "#4292c6",
            "Burkholderia orbicola" = "#d57bd4")
```

Let's create a line plot of log10 CFU/mL per species over time, with facets showing the high and low iron. We will plot a ribbon that represents the standard deviation (`sd`), thin lines that represent each replicate (`tech.rep`), and a thick line representing the mean CFU/mL for each species (`mean.cfu`). We'll utilize the `stat_summary` function that we saw earlier.

```{r}
counts %>%
  mutate(log10.cfu = log10(cfu)) %>% # create a new column with log10 CFU values
  ggplot(aes(x = time,
             y = log10.cfu)) +
  stat_summary(geom = "ribbon",
               fun.data = mean.sd,
               aes(group = species, # group aesthetic specifies how the lines are joined together
                   fill = species),
               alpha = 0.5) +
  geom_line(aes(group = interaction(species,iron,tech.rep),
                color = species),
            linewidth = 0.1,
            alpha = 0.7) +
  stat_summary(geom = "line",
               fun.data = mean.sd,
               aes(color = species,
                   group = interaction(species,iron)),
               linewidth = 1) +
  scale_color_manual(name = "Species", # both color and fill must have the same name if we want to combine the legend
                     values = speciesCols) +
  scale_fill_manual(name = "Species",
                    values = speciesCols) +
  labs(x = "Time (h)", # set axis labels
       y = "Log<sub>10</sub> CFU/mL") +
  theme_bw() + # remove grey background
  facet_grid(cols = vars(iron)) + # facet plot by high/low iron
  theme(strip.background = element_rect(fill = "white", # remove grey background from facet titles
                                        color = "black"),
        legend.text = element_text(face = "italic"), # set legend font to italic
        legend.position = "inside", # move legend inside bounds of plot
        legend.position.inside = c(0.8,0.2), # use a vector to set x and y position (0 - 1)
        legend.background = element_rect(fill = "white", # set box around legend
                                         color = "black"),
        ,
        axis.title.y = element_markdown()) # allow axis title to read html code for subscript
```

### Activities

#### Green 1

Create a scatter plot using the columns Sepal.Length (x) and Sepal.Width (y) from the iris dataset.

```{r}

```

#### Green 2

Make a plot where all the points are green, and the line is colored by the species of iris.

```{r}

```

#### Blue 1

Make a plot that includes regression lines for individual species as well as the overall data.

```{r}

```



```{r}
iris %>%
  pivot_longer(cols = 1:4,
               names_to = "measurement",
               values_to = "mm") %>%
  group_by(Species,measurement) %>%
  summarise(mean = mean(mm)) %>%
  ggplot(aes(x = measurement,
             y = mean,
             fill = Species)) +
  geom_col(position = "dodge")
```

